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Abstract. Motivated by applications in neuroanatomy, we propose a novel methodology for 
estimating the heritability which corresponds to the proportion of phenotypic variance which 
can be explained by genetic factors. Estimating this quantity for neuroanatomical features is 
a fundamental challenge in psychiatric disease research. Since the phenotypic variations may 
only be due to a small fraction of the available genetic information, we propose an estimator 
of the heritability that can be used in high dimensional sparse linear mixed models. Our 
method consists of three steps. Firstly, a variable selection stage is performed in order 
to recover the support of the genetic effects - also called causal variants - that is to find 
the genetic effects which really explain the phenotypic variations. Secondly, we propose a 
maximum likelihood strategy for estimating the heritability which only takes into account the 
causal genetic effects found in the first step. Thirdly, we compute the standard error and the 
95% confidence interval associated to our heritability estimator thanks to a nonparametric 
bootstrap approach. Our contribution consists in providing an estimation of the heritability 
with standard errors substantially smaller than methods without variable selection when the 
genetic effects are very sparse. Since the real genetic architecture is in general unknown in 
practice, we also propose an empirical criterion which allows the user to decide whether it is 
relevant to apply a variable selection based approach or not. We illustrate the performance of 
our methodology on synthetic and real neuroanatomic data coming from the Imagen project. 
We also show that our approach has a very low computational burden and is very efficient 
from a statistical point of view. 


1. Introduction 

For many complex traits in human population, there exists a huge gap between the ge¬ 
netic variance explained by population studies and the variance explained by specific variants 
found thanks to genome wide association studies (GWAS). This gap has been called by [9] 
and HQ! the “dark matter” of the genome or the “dark matter” of heritability. Various pop¬ 
ulation studies have shown that up to 80% of the variability of neuroanatomical phenotypes 
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such as the brain volume could be explained by genetic factors, see for instance |17j . This 
result is very important since several psychiatric disorders are shown to be associated to 
neuroanatomical changes, for instance macrocephaly and autism m or reduced hippocam¬ 
pus and schizophrenia [T]. Estimating properly the impact of the genetic background on 
neuroanatomical changes is a crucial challenge in order to determine afterwards if this back¬ 
ground can either be a risk factor or a protective factor from developing psychiatric disorders. 
The GWAS studies performed for instance by m identified genetic variants involved in the 
neuroanatomical diversity, which contributes to understand the impact of genetic factors. 
However, in the course of these studies, it is shown that this approach only explains a small 
proportion of the phenotypic variance. In order to understand the nature of the genetic 
factors responsible for major variations of the brain volume, m used linear mixed models 
(LMM) to consider the effects of all the common genetic diversity characterized by the Single 
Nucleotide Polymorphisms (SNPs). This approach had been suggested by [22] to study the 
effects of the SNPs on the height variations. The model they considered is a LMM defined as 
follows: 


(1) Y = X/3 + Zu + e , 

where Y = (Yi,... ,1^)' is the vector of observations (phenotypes), X is a n x p matrix of 
predictors, /3 is a p x 1 vector containing the unknown linear effects of the predictors, Z is the 
genetic information matrix, u and e correspond to the random effects. More precisely, Z is a 
version of W with centered and normalized columns, where W is defined as follows: Wij = 0 
(resp. 1, resp. 2) if the genotype of the zth individual at locus j is qq (resp. Qq, resp. QQ) 
where pj denotes the frequency of the allele q at locus j. In (j^, the vector e corresponds to 
the environment effects and the vector u corresponds to the genetic random effect, that is the 
j-th component of u is the effect of the j-th. SNP on the phenotype. In the modeling of |22j . 
all the SNPs have an effect on the considered phenotype, that is 

(2) u ~ AA ^0, (T*^Id]Rn^ and e ~ AA ^0, (T*^Id]Rn^ . 

The covariance matrix of Y can thus be written as: 

ZZ' 

~W ’ 


Var(Y) = Y(T*^R -I- (T*^IdRn , where R 
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and the parameter 77 * defined as 

+ af 

is commonly called the heritability mm), and corresponds to the proportion of phenotypic 
variance which is determined by all the SNPs. 

Since all SNPs are not necessarily causal, it seems more realistic to extend the previous 
modeling by assuming that the genetic random effects can be sparse, that is only a proportion 
q of the components of u are non null: 

(4) Ui (1 - g)(5o + qM{0, for all 1 ^ z ^ A^, 

where q is in (0,1], and Sq is the point mass at 0. Then the definition of ry* has to be adjusted 
as follows: 

Nqa*'^ + af ' 

It corresponds to the proportion of phenotypic variance which is due to a certain number of 
causal SNPs which are, obviously, unknown. Let us emphasize that, in most applications, the 
proportion q of causal SNPs is also unknown, and that it may happen that the scientist has 
no idea how small q is. 

When q = 1, that is when considering the modeling most proposed approaches to 
estimate the heritability derive from a likelihood methodology. We can quote for instance the 
REstricted Maximum Likelihood (REML) strategies, originally proposed by [13] and then 
developed in US). Several approximations of the REML algorithm have also been proposed, 
see for instance the software EMMA proposed by m or the software GCTA (|22|,[2T]). 

We proposed in |3] another method based on a maximum likelihood strategy to estimate 
the heritability and implemented in the R package HiLMM. We proved in [3| the following 
theoretical result: though the computation of the likelihood is based on the modeling assump¬ 
tion Q, the estimator is consistent (unbiased) under the less restrictive modeling assumption 
Q. We believe this consistency result remains true for the estimators produced using the 
algorithms REML, EMMA, GCTA. But we also proved that, when q ¥= I, the standard error 
is not the one computed by the softwares when q = 1 and may be very large. We obtained a 
theoretical formula for the asymptotic variance of the estimator (depending in particular on 
q) and conducted several numerical experiments to understand how this asymptotic variance 
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gets larger depending on the various quantities, in particular with respect to q and the ratio 
n/N. We observed that this variance indeed gets larger when q gets smaller, so that the 
accuracy of the heritability estimator is slightly deteriorated when all SNPs are not causal. 
Thus, a first problem is to find a method able to produce an estimator with smaller standard 
error than those obtained using only likelihood strategies. Also, since this standard error 
depends on q, a second problem is to produce a confidence interval one could trust without 
knowing q. 

The goal of this paper is to address both problems. The results we obtained in [1] suggest 
the following. If we knew the set of causal SNPs, then, considering only this (small) subset in 
the genetic information matrix, we would obtain with HiLMM an estimator having a smaller 
standard error than when using all SNPs in the genetic information matrix. Thus, our new 
practical method contains a variable selection step. 

Variable selection and signal detection in high dimensional linear models have been exten¬ 
sively studied in the past decade and there are many papers on this subject. Among them, 
we can quote m and p] about variable selection and references therein. The case of high 
dimensional mixed models has received little attention. As far as variable selection methods 
in the random effects of LMM are concerned, we are only aware of the work of p] and [H]. 
Let us mention that regarding the estimation of heritability with possible sparse effects, there 
is also the bayesian approach of [7] and pS]; which proposes an interesting estimator for the 
heritability but which is computationally very demanding. Notice that, in our framework, we 
are not far from the situation for which it is proved in m that the support cannot be fully 
recovered, which happens when Nqlog{l/q) » n. The variable selection step we propose 
takes elements from both ultrahigh dimension methods (0, i, mi) and classical variable 
selection techniques l|18jL 

The second step of our method is to apply HiLMM using the selected subset of causal 
SNPs produced by the first step. Finally, we propose a non parametric bootstrap procedure 
to get confidence intervals with prescribed coverage. The whole procedure requires only a few 
minutes of computation. 

To conclude, we propose in this paper a very fast method to estimate the heritability and 
construct a confidence interval substantially smaller than without variable selection when the 
genetic effects are very sparse. Since the real genetic architecture is in general unknown in 
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practice, we also propose an empirical criterion which allows the user to decide whether it 
is relevant to apply a variable selection based approach or not. Our method has also the 
advantage to return a list of SNPs possibly involved in the variations of a given quantitative 
feature. This set of SNPs can further be analyzed from a biological point of view. 

The paper is organized as follows. Section describes the data set which motivated our 
work. Section provides the detailed description of the method, and Section displays the 
results of the numerical study. They were obtained by using the R package EstHer that we 
developed and which is available from the Comprehensive R Archive Network (CRAN). The 
simulation results illustrate the performance of our method on simulations and show that it is 
very efficient from a statistical point of view. In Section we provide an empirical criterion 
to help the user to decide whether it is relevant to apply a variable selection based approach 
or not. In Sectionwe propose a thorough comparison of our approach with other methods 
in terms of statistical and numerical performances. Finally, the results obtained on the brain 
data described in Section can be found in Section We also provide a discussion section 
at the end of the paper. 


2. Description of the data 

We worked on data sets provided by the European project Imagen, which is a major study 
on mental health and risk taking behaviour in teenagers. The research program includes 
questionnaires, interviews, behaviour tests, neuroimaging of the brain and genetic analyses. 
We will focus here on the genetic information collected on approximately 2000 teenagers as 
well as measurements of the volume of several features: the intracranial brain volume (icv), 
the thalamus (th), the caudate nucleus (ca), the amygdala (amy), the globus pallidus (pa), 
the putamen (pu), the hippocampus (hip), the nucleus accubens (acc) and the total brain 
volume (bv). Figure which comes from m, is a schematic representation of these different 
areas of the brain. The data set contains n = 2087 individuals and N = 273926 SNPs, as well 
as a set of fixed effects, which in our case are the age (between 12 and 17), the gender and 
the city of residency (London, Nottingham, Dublin, Dresden, Berlin, Hamburg, Mannheim 
and Paris). 

In the following, our goal will thus be to provide a method to estimate the heritability of 
these neuroanatomical features. 
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Figure 1. Different regions of the brain (this figure is taken from |19ji. 

3. Description of the method 

The method that we propose can be split into two main parts: the first one consists in a 
variable selection approach and the second one provides an estimation of the heritability and 
the associated 95% confidence interval which is computed by using non parametric bootstrap. 

At the beginning of this section we shall consider the case where there is no fixed effects, 
that is 


(6) Y = Zu + e 

but we explain at the end of this section how to deal with fixed effects. Let us first describe 
our variable selection method which consists of three steps. 

3.1. Variable selection. Inspired by the ideas of [5], we do not directly apply a Lasso type 
approach since we are in an ultra-high dimension framework. Hence, we start our variable 
selection stage by the SIS (Sure Independence Screening) approach, as suggested by [5], 
in order to select the components of u which are the most correlated to the response Y 
and then we apply a Lasso criterion which depends on a regularization parameter A. This 
regularization parameter is usually chosen by cross validation but here we decided to use the 
stability selection approach devised by m which provided better results in our framework. 
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Step 1: Empirical correlation computation. The first step consists in reducing the number of 
relevant columns of Z by trying to remove those associated to null components in the vector 
u. For this, we use the SIS (Sure Independence Screening) approach proposed by [5] and 
improved by [8] in the ultra-high dimensional framework. More precisely, we compute for 
each column j of Z: 


Cj = 

and we only keep the iVmax columns of Z having the largest Cj. In practice, we choose the 
conservative value A^max = n, inspired by the comments of [5] on the choice of A^max- 

In the sequel, we denote by Zred the matrix containing these n relevant columns. This 
first step is essential for our method. Indeed, on the one hand, it substantially decreases the 
computational burden of our approach and on the other hand, it reduces the size of the data 
and thus makes classical variable selection tools efficient. 


Step 2: LASSO criterion and stability selection. In order to refine the set of columns (or 
components of u) selected in the first step and to remove the remaining null components in 
the vector u, we apply a Lasso criterion originally devised by [18] which has been used in many 
different contexts and has been thouroughly theoretically studied. It consists in minimizing 
with respect to u the following criterion: 

(7) CritA(u) = ||Y - Zred^lli + A||u||i , 

which depends on the parameter A and where HxjH = Xif=i ||x||i = l®*l 

X = {xi,... ,Xp). The choice of the regularization parameter A is crucial since its value 
may strongly affect the selected variables set. Different approaches have been proposed for 
choosing this parameter such as cross-validation which is implemented for instance in the 
glmnet R package. Here we shall use the following strategy based on the stability selection 
proposed by m- 

The vector of observations Y is randomly split into several subsamples of size n/2. For 
each subsample, we apply the LASSO criterion for a fixed parameter A and the selected 
variables are stored. Then, for a given threshold, we keep in the final set of selected variables 
only the variables appearing a number of times larger than this threshold. In practice, we 
generated 50 subsamples of Y and we chose the parameter A as the smallest value of the 
regularization path. As explained in |12] . such a choice of A ensures that some overfitting 
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occurs and hence that the set of selected variables is large enough to include the true variables 
with high probability. 

The matrix Z containing only the hnal set of selected columns will be denoted by Zgnai in 
the following, where Affinal denotes its number of columns. 

The threshold has to be chosen carefully: keeping too many columns in Zgnai could indeed 
lead to overestimating the heritability and, on the contrary, removing too many columns of Z 
could lead to underestimating the heritability. In the “small g” situations where it is relevant 
to use a variable selection approach a range of thresholds in which the heritability estimation 
is stable will appear as suggested by [12] . In practice, we simulate observations Y satisfying 
(©, by using the matrix Z, for different values of q and for different values rf and we observe 
that this stability region for the threshold appear for small values of q. This procedure is 
further illustrated in Section HI 

3.2. Heritability estimation and confidence interval. 

3.2.1. Heritability estimation. For estimating the heritability, we used the approach that we 
proposed in [3|. It is based on a maximum likelihood strategy and was implemented in the R 
package HiLMM. Let us recall how this method works. 

In the case where q = 1, which corresponds to the non sparse case, 

Y ~ AA (o, 7yV*2R + (1 - r/*)cT*2ldMn) , 

with (T*^ = Na^ + (T*^ and R = Z fiv,ai Z ^^,,] /jVfinab where Zgnai denotes the matrix Z in which 
the columns selected in the variable selection step described in Section [3T| are kept. 

Let U be defined as follows; U'U = UU' = IdRn and URU' = diag(Ai,..., A„,), where 
the last quantity denotes the diagonal matrix having its diagonal entries equal to Ai,..., A^. 
Hence, in the case where g = 1, 


(8) Y = U'Y ~ AA(0, T) 


with r = diag(r/*cj*^Ai + (1 — rf)a *‘^,..., rf a*‘^\n + (1 — 


where the Xfs are the eigenvalues of R. 

We propose to define r} as a maximizer of the log-likelihood 

\ 1 


(9) 




n h(Ai - 1) + ly 


2 log (r?(Ai - 1) + 1) , 
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where the l^’s are the components of the vector Y = U'Y. 

We now explain how to obtain accurate confidence intervals for the heritability by using a 
non parametric bootstrap approach. 

3.2.2. Bootstrap confidence interval. We used the following procedure: 

- Step 1: We estimate rj* and cr*^ by using our approach described in the previous 
subsection. The corresponding estimators are denoted ry and a. 

- Step 2: We compute Ynew = fwhere Y is defined in ([^ and f has the same 
structure as T defined in Q except that rj* and a* are replaced by their estimators r) 
and d, respectively. 

- Step 3: We create K vectors (Ynew,i)is;j^Ar from Ynew by randomly choosing each of 
its components among those of Ynew- 

- Step 4: We then build K new vectors (Ysamp,i)is;is:A' as follows: Ysamp.i = fYnew,i- 
For each of them we estimate the heritability. We thus obtain a vector of heritability 
estimators (fyi,..., f}x)- 

- Step 5: For obtaining a 95% bootstrap confidence interval, we order these values of 
fjk and keep the ones corresponding to the [0.975 x K\ largest and the [0.025 x K\ 
smallest, where [xj denotes the integer part of x. These values define the upper 
and lower bounds of the 95% bootstrap confidence interval for the heritability rj*, 
respectively. 

A bootstrap estimator of the variance can be obtained by computing the empirical variance 
estimator of the fyfc’s. In practice, we chose K = 80 replications. 

In Step 2 of the previous algorithm, we should be in the non sparse case q = 1 thanks to 
the variable selection stage. Hence, the covariance matrix of Ynew should be close to identity. 

Observe that our resampling technique is close to the one proposed by [?] for building 
permutation tests in linear mixed models. 

3.3. Additional fixed effects. The method described above does not take into account the 
presence of fixed effects. For dealing with such effects we propose to use the following method, 
which mainly consists in projecting the observations onto the orthogonal of Im(X), the image 
of X, to get rid of the fixed effects. In practice, instead of considering Y and Z we consider 
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Y = A'Y and Z = A'Z, where A is a n x (n — d) matrix (d being the rank of the fixed 
effects matrix), such that AA' = Px, A'A = Id]gn-d and Px = Id]Rn — X(X'X)“^X'. This 
procedure was for instance used by [6]. 


4. Numerical study 

We present in this section the numerical results obtained with our approach which is 
implemented in the R package EstHer. 

4.1. Simulation process. Since in genetic applications, the number n of individuals is very 
small with respect to the number N of SNPs, we chose n = 2000 and N = 100000 in our 
numerical study. We also set = 1, we shall consider different values for q and we shall 
change the value of cr* in order to have the following values for rj*: 0.4, 0.5, 0.6 and 0.7. We 
generate a matrix W such that its columns Wj are independent binomial random variables of 
parameters n and pj, where pj is randomly chosen in [0.1,0.5]. We compute Z by centering 
and empirically normalizing the matrix W. The random effects are generated according to 
Equation Q and then we compute a vector of observations such that Y = Zu + e. 

We can make two important comments about the previous simulation process. Eirstly, we 
generated a matrix W with independent columns, that is we assume that the SNPs are not 
correlated. Since this assumption may not be very realistic in practice, we provide in Section 


4.2.5 some additional simulations where the generated matrix W has been replaced by the 
real matrix W coming from the IMAGEN project. Secondly, we did not include fixed effects 
but we show some results in Section [4.2.41 when fixed effects are taken into account. 


4.2. Results in very sparse scenarios. In this section, we shall focus on the performances 
of our method in a very sparse scenario, that is 100 causal SNPs out of 100,000. We will 
describe all the results in terms of heritability estimation, support recovery and computational 
times in this particular case, then we will study other sparsity scenarios. 


4.2.1. Choice of the threshold. In order to determine the threshold, we apply the procedure 
described in Section ^ and 3.2. 1[ Eigure displays the mean of the absolute value of the 
difference between rj* and the estimated value fj for different thresholds and for different 
values of ry* obtained from 10 replications. We can see from this figure that in the case 
where the number of causal SNPs is relatively small; 100, that is q = 10“^, our estimation 
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procedure provides relevant estimations of the heritability for a range of thresholds around 
0.75. Moreover, the optimal threshold leading to the smallest gap between i) for different 
values of ij* is 0.76. We will use this value in the following numerical study. However, the 
way of choosing the threshold will be further discussed, especially in the section dedicated to 
the study of the genetic data. 


Figure 2. Absolute difference between r]* and f) for thresholds from 0.6 to 
0.9 and for q = 10~^ (100 causal SNPs). 



4.2.2. Confidence intervals. We use the non parametric boostrap approach described in Sec¬ 
tion in order to compute the confidence intervals associated to the estimations of the her¬ 
itability. Table shows that the 95% confidence intervals obtained by bootstrap and the 
empirical confidence intervals are very similar. The empirical confidence intervals are com¬ 
puted as follows: the different estimations of rj* obtained along the different replications are 
ordered, the [0.975 x M\ largest and the [0.025 x M\ smallest values correspond to the upper 
(resp. lower) bound of the 95% empirical confidence interval. Here, \x\ denotes the integer 
part of X and M is the number of replications. From Table we can see that the empirical 
confidence intervals are included in the bootstrap intervals, which means that our approach 
provides conservative intervals. 

4.2.3. Comparison between the methods with and without selection. Our results are compared 
to those obtained if we do not perform the selection before the estimation, that is with the 
method implemented in HiLMM (’’without”), but also with an approach which assumes the 
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Table 1. 95 % confidence intervals for fj obtained empirically and by our 
Bootstrap method. 


ri* 

0.4 

0.5 

0.6 

0.7 

Bootstrap 

[0.353 ; 0.503] 

[0.413 ; 0.565] 

[0.494 ; 0.654] 

[0.596 ; 0.738] 

Empirical 

[0.391 ; 0.470] 

[0.449 ; 0.542] 

[0.496 ; 0.645] 

[0.618 ; 0.720] 


Table 2. 95 % confidence intervals for fj obtained by our approach, GCTA, 
the oracle approach and the approach without selection (“without”). 


r?* 

0.4 

0.5 

0.6 

0.7 

EstHer 

[0.353 ; 0.503] 

[0.413 ; 0.565] 

[0.494 ; 0.654] 

[0.596 ; 0.738] 

Oracle 

[0.362 ; 0.472] 

[0.414 ; 0.563] 

[0.529 ; 0.670] 

[0.619 ; 0.745] 

without 

[0.120 ; 0.880] 

[0.102 ; 0.812] 

[0.320 ; 0.938] 

[0.349 ; 0.932] 


position of the non null components to be known (oracle). The results are displayed in Figure 
and in Table In this table, the confidence intervals displayed for the lines ” Oracle” and 
’’without” are obtained by using the asymptotic variance derived in [1] which corresponds to 
the classical inverse of the Fisher information in the case q = 1- We observe that our method 
without the selection step provides similar results, that is almost no bias but a very large 
variance due to the framework iV » n. Our method EstHer considerably reduces the variance 
compared to this method and exhibits performances close to those of the oracle approach 
which, contrary to our approach, knows the position of the non null components. 


4.2.4. Additional fixed effects. We generated some synthetic data according to the process 


described in Section 4.1 but we added a matrix of fixed effects containing two colums. Figure 
(a) displays the corresponding results which show that the presence of fixed effects does not 
alter the heritability estimation. 


4.2.5. Simulations with the matrix W of the IMAGEN data set. We conducted some addi¬ 
tional simulations in order to see the impact of the linkage disequilibrium, that is the possible 
correlations between the columns of Z. Indeed, in the previous numerical study, we generated 
a matrix W with independent columns. The matrix W that we use now to generate the 
observations is the one from our genetic data set, except that we truncated it in order to have 
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(a) 


(b) 



1—r 


(c) 


(d) 


Figure 3. Estimation of the heritability and the corresponding 95% confi¬ 
dence intervals when q =10“^, and for different values of r]* : (a) rj* = 0.4, 
(b) rj* = 0.5, (c) rj* = 0.6, (d) rj* = 0.7. The means of the heritability esti¬ 
mators (displayed with black dots), the means of the lower and upper bounds 
of the 95% confidence intervals are obtained from 20 replicated data sets for 
the different methods: without selection (“without”), “oracle” which knows 
the position of the null components and EstHer. The horizontal gray line 
corresponds to the value of rj*. 


n = 2000 and N = 100000. The results of this additional study are presented in Figure]^ (b). 
We can see that they are similar to those obtained previously in Figure which means that 
our method does not seem to be sensitive to the presence of correlation between the columns 

of W. 
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“1-1-n 

0.5 0.6 0.7 


(a) (b) 

Figure 4. Estimated value of the heritability with 95 % confidence intervals. 

The results are displayed for several values of rj*: 0.5, 0.6 and 0.7. (a) The 
data sets were generated including fixed effects, (b) The matrix Z used to 
generate data sets comes from the IMAGEN data. The black dots correspond 
to the mean of fj over 10 replications and the crosses are the real value of rj*. 

4.2.6. Computational times. The implementation that we propose in the R package EstHer is 
very efficient since it only takes 45 seconds for estimating the heritability and 300 additional 
seconds to compute the associated 95% confidence interval. These results have been obtained 
with a computer having the following configuration: RAM 32 GB, GPU 4 x 2.3 GHz. 

4.2.7. Recovering the support. When the number of causal SNPs is reasonably small, our 
variable selection method is efficient to estimate the heritability and we wonder if it is reliable 
as well to recover the support of the random effects. In Figure we see the proportion 
of support estimated by our method when there are 100 causal SNPs: our method selects 
around 130 components. We then focus on the proportion of the real support which has been 
captured by our method: we see that it may change according to g*. Indeed, the higher i]*, 
the higher this proportion. Nevertheless, even in the worst case, that is rf = 0.5, Figure 
shows that even if we keep only 30% of the real non null components, we select the most 
active ones. 
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(a) (b) 

Figure 5. (a) Boxplots of the length of the set of selected variables with 
EstHer for 40 repetitions. The real number of non null components is 100. (b) 
Boxplots of the proportion of the real non null components captured in the 
set of selected variables. 


The ability of recovering the support in linear models has been studied by [20] in ultra high 
dimensional cases. The author shows that with a non null probability, the support cannot be 
estimated under some numerical conditions on the parameters q, N and n (namely if there are 
considerably more variables N than observations n, and if the number of non null components 
qN is relatively high). In this simulation study, even when we consider small values of q (for 
instance q = 10“^, that is 100 causal SNPs), we are not far from to the ultra high dimensional 
framework described in m. which can explain the difficulties to recover the full support. 


4.3. Results when the number of causal SNPs is high. In subsection 4.2 we show the 
performance of our method in the case where the proportion of causal SNPs q is small, that 
is around 10“^. In this subsection, we focus on a more polygenic scenario, that includes the 
cases where thousands of SNPs or ten of thousands of SNPs are causal. 


4.3.1. Results when there are SNPs with moderate and weak effects. We first focus on the 
statistical performance of EstHer when there are a lot of SNPs (1000 or 10000) with small 
effects (for example, that explain 5% of the phenotypic variations), and a small number 
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Figure 6. Barplots of the proportion of components found by our method 
as function of the most efficient variables. For example, the first bar is the 
proportion of the 10 % higher components that we captured with our selection 
method. The histograms are displayed for several values of t]*: 0.5 (a), 0.6 
(b), 0.7 (c). 

(around 100) with moderate effects. We can see from Figure that, in this case, EstHer 
provides unbiased estimations with a small variance. 

4.3.2. Results when all SNPs have moderate effeets. If all causal SNPs have moderate effects 
and if the number of these causal SNPs is high, namely greater than 1000, EstHer underes¬ 
timates the heritability. These results are displayed in Eigure Moreover, we can see from 
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r]* = 0.4 T]* = 0.6 




Figure 7. Results of HiLMM and EstHer when there are a few causal SNPs 
with moderate effects and a lot of SNPs with small effects. The proportion of 
each is 100 out of 1000 (up) and 100 out of 10000 (bottom), with rf = 0.4 and 

0 . 6 . 

Figure that there is no threshold choice that can provide accurate estimations of heritability 
for all values of rj*. 

5. A CRITERION TO DECIDE WHETHER WE SHOULD APPLY ESTHeR OR HiLMM 

On the one hand, we observed that applying HiLMM provides unbiased estimations of 
the heritability, no matter the number of causal SNPs. However, the main drawback of this 
estimator is its very large variance. On the other hand, if the number of causal SNPs is 
not too high, EstHer provides unbiased estimations of the heritability with standard errors 
substantially smaller than HiLMM. However, if the number of causal SNPs is high, EstHer 
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r]* = 0.4 T]* = 0.6 




Figure 8 . Results of HiLMM and EstHer for 1000 (up) and 10000 (bottom) 
causal SNPs and for r]* = 0.4 and 0.6. 


underestimates the heritability. These observations are similar to those made by [23], who 
built an hybrid estimator able to deal with both sparse and non sparse scenario, to which 
we will compare our approach in Section]^ Therefore, we propose hereafter a rule to decide 
whether it is better to apply EstHer or HiLMM. We can see from Figure]^ that when there 
are 100 causal SNPs, there is a large range of threshold values which provide an accurate 
estimation of r]*, but when there are 1000 or 10000 causal SNPs, see Figure]^, the estimations 
are very different even for close thresholds. This observation gave us the idea of quantifying 
the stability of the estimations around the threshold that we determined as the optimal 
one. More precisely, for each threshold, we have an estimation of heritability with a 95% 
confidence interval, and we count the number of thresholds for which the confidence intervals 


overlap. Figure 10 confirms the stability around the best threshold for different values of rj* 















IMPROVING HERITABILITY ESTIMATION BY A VARIABLE SELECTION APPROACH IN SPARSE HIGH DIMENSIONAL LIIS 




1000 causal SNPs 10000 causal SNPs 

Figure 9. Absolute difference {rj* — r]\ for thresholds from 0.6 to 0.9 and for 
1000 (left) and 10000 (right) causal SNPs. 

Table 3. Mean value of the number of overlapping confidence intervals for 
16 thresholds from 0.7 to 0.85. 


V* 

100 causal SNPs 

1000 causal SNPs 

10000 causal SNPs 

0.4 

11.9 

8.1 

7.8 

0.5 

15.3 

6.9 

7 

0.6 

16 

9.2 

7.1 


and Table [^displays the number of ovelapping confidence intervals. We empirically determine 
the following criterion: if the mean number of thresholds is greater than 10 (over 16 tested 
thresholds), we apply EstHer, if not, we apply HiLMM. The results obtained by using this 
criterion are displayed in Figure [TT| 

6. Results after applying the decision criterion and comparison to other 

METHODS 


6.1. Statistical performances. In this section we show the results obtained after applying 
the criterion described in Sectionj^ We compare these results to those obtained using HiLMM, 
but also with the software GEMMA described in [21]. GEMMA can fit both a non sparse 
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rj* = 0.4 r]* = 0.5 






T]* = 0.6 



Figure 10. Estimation of the heritability with 95% conhdence intervals for r]* 
from 0.4 to 0.6 (from left to right), and from 100, 1000 and 10000 causal SNPs 
from top to bottom. Each graph shows the heritability estimations with 95% 
confidence intervals computed with HiLMM (“without”) and for thresholds 
between 0.7 and 0.85. 


linear mixed model (GEMMA-LMM) and a sparse linear mixed model if the BSLMM option 
is chosen denoted by BSLMM in the sequel. As explained in [23], BSLMM can deal with very 
sparse and also with very polygenic scenarios. 
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We can see from the bottom part of Figure 11 that, in very polygenic scenarios {q = 0.1, 
namely 10,000 causal SNPs), all the methods provide similar results: the four estimators are 
indeed empirically unbiased, but with a very large variance. 

In sparse scenarios {q = 10“^, namely 100 causal SNPs), we can see from the top part of 
Figure that EstHer provides better results than HiLMM and GEMMA-LMM which exhibit 
similar statistical performances. In sparse scenarios, the variance of the BSLMM estimator 
is larger than the one provided by EstHer and smaller than the one provided by GEMMA- 
LMM and HiLMM. However, the performances of BSLMM could perhaps be improved by 
changing the MGMC parameters. Here, for computational time reasons, we used the default 
parameters that is 100,000 and 1,000,000 for the number of burn-in steps and the number of 
sampling, respectively. 


6.2. Computational times. The computational times in seconds for one estimation of the 
heritability with BSLMM and the heritability estimation for 16 thresholds as well as the 
associated confidence intervals with our method EstHer are displayed in EigurefT^ We chose 
this number of thresholds since we applied the criterion defined in Section It should be 
noticed that the computational times for EstHer could be reduced by diminishing the number 
of thresholds. Eor BSLMM we used the default parameters for the number of burn-in steps 
and the number of sampling. We can see from this hgure that the gap between EstHer and 
BSLMM is all the more important that N is large. Contrary to our approach, BSLMM seems 
to be very sensitive in terms of computational time to the value of N. 


7. Applications to genetic data 


In this section, we applied our method to the neuroanatomic data coming from the Imagen 
project. In this data set, n = 2087 individuals and N = 273926 SNPs. Eor further details on 
this data set, we refer the reader to Section 


7.1. Calibration of the threshold. We start by finding the threshold which is the most 
adapted to the Imagen data set. We use the same technique as the one described in Section 
4.2. 1[ for several values of rj* and several thresholds, we display the absolute value of rj* — r). 


see Figure [T^ The only difference with Section 4.2.1 is that we generated the observations by 
using the matrix W coming from the IMAGEN data set. According to Figure [T^ we can find 
a reliable range of thresholds for estimating the heritability for all rj* from 0.4 to 0.7 when 
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Esther BSLMM HiLMM GEMMA-LMM 



Esther BSLMM HiLMM GEMMA-LMM 



Esther BSLMM HiLMM GEMMA-LMM 


Figure 11. Estimations of r) with 95 % confidence intervals obtained using 
EstHer, BSLMM, HiLMM and GEMMA-LMM with 100 causal SNPs (top) and 
10,000 causal SNPs (bottom). The results are obtained with 10 replications. 


the number of causal SNPs is smaller than 100. This optimal threshold is equal to 0.79. We 
shall use this value in the sequel. 

7.2. Application of the decision criterion. Since we determined in the previous section 
that the optimal threshold is 0.79, we apply EstHer for thresholds around this value, that is 
from 0.7 to 0.85. We then count the number of overlapping confidence intervals, as explained 
in Section The results are displayed in Table We observe from this table that the 
sensitivity to the choice of the threshold varies substantially from one phenotype to another. 
Hence, we choose to apply our EstHer approach to the most stable phenotypes with respect 
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50000 100000 200000 


Figure 12. Times (in seconds) to compute one heritability estimation with 
BSLMM (crosses) and EstHer (dots) by using 16 thresholds for n = 2000 and 
different values of N from 50, 000 to 200,000. 



Figure 13. Absolute value of the difference between ij* and r) for thresholds 
from 0.6 to 0.9, and for different values of qN: (a) 50 causal SNPs, (b) 100 
causal SNPs. Each difference has been computed as the mean of 10 replica¬ 
tions. 


to our criterion, namely pa, amy and acc. For the other phenotypes we recommand to apply 
HiLMM or another similar approach such as GCTA or GEMMA-LMM. 
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Table 4. Mean value of the number of overlapping confidence intervals for 
16 thresholds from 0.7 to 0.85. 


Phenotype 

Number of thresholds 

Bv 

7.19 

Hip 

7.5 

Icv 

7.37 

Acc 

9.94 

Amy 

9.88 

Th 

7.5 

Ca 

7.13 

Pu 

7.13 

Pa 

10.75 



(a) (b) 

Figure 14. (a) Heritability estimations of bv, icv, th, pu, pa, hip, amy, acc, 
and ca with 95% confidence intervals obtained using EstHer or HiLMM ac¬ 
cording to the outcome of our decision criterion, (b) Heritability estimations 
of bv, icv, th, pu, pa, hip, amy, acc and ca with 95% confidence intervals 
obtained using HiLMM. 
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7.3. Results. Figure [l4| (a) shows the heritability estimation with 95 % conhdence intervals 
for all phenotypes, using either EstHer or HiLMM according to the outcome of our decision 


criterion. Figure 14 (b) shows the results obtained by using HiLMM, namely without any 
variable selection step. We compare our results with the ones obtained by m who estimated 
the heritability of the same phenotypes by using the software GCTA. On the one hand, we can 
see from Figure [T4| that in the cases where EstHer is used the conhdence intervals given by our 
methodology are substantially smaller and included in those provided by either HiLMM or 
|19j . On the other hand, when HiLMM is used our results are on a par with those obtained by 
|19j . Moreover, our approach provides a list of SNPs which may contribute to the variations 
of a given phenotype and which could be further analyzed from a biological point of view in 
order to identify new biological pathways. 


8. Conclusion 

We propose in this paper a practical method to estimate the heritability in sparse linear 
mixed models using variable selection tools, as well as conhdence interval obtained thanks to 
a non parametric bootstrap approach. Our approach is implemented in the R package EstHer 
which is available from the Comprehensive R Archive Network (CRAN) and from the web 
page of the hrst author. In the course of this study, we showed that our approach has two 
main features which makes it very attractive. Firstly, it is very efficient from a statistical 
point of view since it provides conhdence intervals considerably smaller than those obtained 
with methods without variable selection. Secondly, its very low computational burden makes 
its use feasible on very large data sets coming from quantitative genetics. 

Moreover, we observed that the statistical performance of the EstHer approach are all the 
more impressive that the level of sparsity is high that is when q is small. For this reason, we 
also proposed an empirical criterion which allows the user to decide whether it is better to 
apply an approach that takes into account the sparsity and starts with a variable selection 
stage, namely EstHer, or an approach which ignores the potential sparsity in the observations, 
namely HiLMM. 
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